American Studies Seminar / Primary Source Write-Up

Author

Emily Zou

Published

October 7, 2023

Project Updates

If you remember where I was at last week, you’ll see that my project has once again dramatically changed.

I was broadly interested in how unique features of online communities affects its members’ reactions and/or resistance to broader changes– specifically, social movements. I was particularly intrigued as to whether approaching online behavior as a network of communities, rather than an amalgation of individuals, could better explain why some movements succeed and others fail.

While working on fleshing out this idea, through cleaning up a dataset of YouTube comments from Hasan Piker’s video on Hogwarts Legacy, I noticed something else, however. A lot of keywords pulled from the comments used a distinct sort of language to make their point, for instance, one commenter wrote that,

“He was born under the sign of the cuck, but his Redditor side is in retrograde.”

This may read as nonsense to some, but is a commonplace statement in online communities, particularly gaming and streaming spaces that regularly refer to ‘cucks’ and ‘Redditors’ to describe something derogatorily. While it is certainly silly, I also see it as an example of how gaming and streaming slang has expanded to describe and make meaning outside of its original topics. I became very curious about the nature of this sort of communication.

Brief Context

Statistically, we know that 82% of American adults get some of their news digitally. Anecdotally, I know that many people get their news through streamer and gamer personalities, similar to how we are more likely to keep up with Evanston and Illinois news, rather than California or China’s. I wonder if the structure of our mental models of news– the events, controversies, celebrities, and policies that we are aware of– shape our means to engage and interpret news, the sustained flow of information. In this case, I am interested in how the communication tools that users acquire through participating in online communities and digital subcultures are being employed to understand and discuss American news– or more accurately to this project, news that is not strictly about gaming and streaming. This sort of work– which emphasizes the networked level of online spaces– would draw on social computing work on online ecologies to reevaluate existing research on media literacy, misinformation, and digital polarization.

Data Background

In any case, what I did here was to test my intuitions to see what I’m actually working with here. Putting Hasan Piker’s channel aside for a moment, I took up two streamers, who are primarily gamers, that I know also cover news on at least a weekly basis:

MoistCritikal (Charlie White) has 13.8 million subscribers on YouTube and, before switching to YouTube streaming in September, was consistently one of Twitch’s top streamers. Other than posting about video games and gaming drama, his other content ranges from U.S politics to anime reviews to ranking Greek Gods.

Mogul Mail is the “news channel” for online streamer Ludwig Ahgren, who streams on YouTube now to 5.41 million subscribers, but had previously gained attention through Twitch. He notably broke the previous record for most paid subscribers after streaming for 31 days straight in 2021.

I collected Ludwig and Charlie’s videos from the past year to find topics they both covered:

1.Elon Musk buys Twitter

2.Andrew Tate

3.US releases UFO report

4.Unity (gaming company) releases new policies

Data Collection

The reason why I’m including the process in an assignment ostensibly just about the source, is that the basis of my later research and analysis will be based on the means by which I got my data. I’m not studying the textual content of these YouTube videos and comments (for now), but rather the broader patterns that their thousands of comments.

I’ll start with the Twitter one here. I used YouTube Data Tools to collect all the comments left under three videos:

1: “Elon Musk’s Worst Idea” from Mogul Mail (Ludwig Aghren), posted on November 2, 2022

2: “Elon Musk’s Worst Idea” from penguinz0 (Charlie White), posted on November 2, 2022

3: “See how Elon Musk is responding to mass Twitter employee resignations” from CNN, posted on November 18, 2022

The first two are obviously very similar– but I would like to refine and/or expand how I choose my ‘non-gamer’ news sources in the future. I chose an American mainstream news channel that covered the Elon Musk/Twitter event in the beginning of November last year. I found that a mainstream news channels post /a lot/, everyday, and their videos tend to be more specific. This makes sense, as they have a much larger team pushing out content everyday, as compared to a single streamer. CNN’s angle– specifically on Musk’s reponse to employees– doesn’t serve me well here if I were to make comparisons without considering this context. However, knowing this is interesting in itself: that people can get their news from broad informal summaries rather than the typical news format.

import pytextrank, spacy
import scattertext as st
import numpy as np
import pandas as pd
from scattertext import SampleCorpora, produce_scattertext_explorer
from scattertext import produce_scattertext_html
from scattertext.CorpusFromPandas import CorpusFromPandas
import IPython
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
/Users/emilyzou/opt/miniconda3/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /Users/emilyzou/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank", last=True)
da = pd.read_csv ('/Users/emilyzou/Desktop/gearthird/cnncommentvideo.csv')
db = pd.read_csv ('/Users/emilyzou/Desktop/gearthird/ludwigelonmuskworstideacomments.csv')
dc = pd.read_csv ('/Users/emilyzou/Desktop/gearthird/moistelonmuskworstideacomments.csv')

def cnn (s):
    return "cnn"

def lud (s): 
    return "ludwig"

def moist (s):
    return "moist"

da ['vod'] = da ['isReply'].apply(cnn)
db ['vod'] = db ['isReply'].apply(lud)
dc ['vod'] = dc ['isReply'].apply(moist)

nami = pd.concat([da,db], ignore_index = True)
sanji = pd.concat([db, dc], ignore_index = True)
zoro = pd.concat([da,dc], ignore_index = True)

After reading in our three datasets (Mogul Mail, Moistcritikal, CNN), I made three new dataframes to work with, since we’re comparing language across two sources, so there is one dataframe from each pairing. Then, we have to do some preprocessing, filtering out words that aren’t going to be helpful for us.

In the future, I’d like to be more careful with this and spend some more time on this part. What I did here is probably not optimal and also not replicable if I end up doing this with more videos, rather than manually. In any case, I first ran all the code and looked through the final graphs, making a list of words that I didn’t think were very informative. I put them in a list called “stopward” to filter out. You can see how this is cheating a bit.

stopward = ['people', 'like', 'would', 'even', 'think', 'get', 'one', 'also', 'br', 'said', 'say', 'thing', 'still', 'could', 'amp', 'gon', 'non', 'something', 'actually', 'go', 'though', 'anyone', 'bye', 'who', 'till', 'https amp', 'https', 'its', 'quot', 'jor']

Then, I made some more general functions for filtering. There is a nice list of the most common words in English language, such as ‘as’ or ‘of’ or ‘and’ that I used, as well as getting rid of words with only two characters and numbers.

def tokenize(tea):
    return nltk.tokenize.word_tokenize(tea)

nami ['tokens'] = nami ['text'].apply(tokenize)
sanji ['tokens'] = sanji ['text'].apply(tokenize)
zoro ['tokens'] = zoro ['text'].apply(tokenize)

def garp (list):
    lista = [i for i in list if i.isalpha() == True]
    liste = [i for i in lista if len(i) > 2]
    listf = [i for i in liste if i not in stopward]
    return [i for i in listf if i not in stopwords.words('english')]

nami ['tony'] = nami ['tokens'].apply(garp)
sanji ['tony'] = sanji ['tokens'].apply(garp)
zoro ['tony'] = zoro ['tokens'].apply(garp)
def backstring (list): 
    return ' '.join(str(x) for x in list)

nami ['tony'] = nami ['tony'].apply(backstring)
sanji ['tony'] = sanji ['tony'].apply(backstring)
zoro ['tony'] = zoro ['tony'].apply(backstring)
nami ['parsed'] = nami ['tony'].apply(nlp)
sanji ['parsed'] = sanji ['tony'].apply(nlp)
zoro ['parsed'] = zoro ['tony'].apply(nlp)
corpus = st.CorpusFromParsedDocuments(nami, category_col = 'vod', parsed_col = 'parsed').build()
corpus1 = st.CorpusFromParsedDocuments(sanji, category_col = 'vod', parsed_col = 'parsed').build()
corpus2 = st.CorpusFromParsedDocuments(zoro, category_col = 'vod', parsed_col = 'parsed').build()

The Plots

CNN / Ludwig

The caveat with the CNN video topic is apparent here– which is something we will tune for later. Data points such as “employees” or “work” don’t really tell us much. This can be fixed with more corpuses of data– what we can catch glimpses of different languages: on “Ludwig’s side”, we see more acronyms/slang such as “imo” and “ahh.” The overall topic differences are also expected, such as the unique usage of words like “streamer”, “chat”, and “cringe.”

html = produce_scattertext_explorer (corpus, 
                                    category =  'cnn', 
                                    category_name = 'cnn', 
                                    not_category_name = 'lud', 
                                    width_in_pixels = 1000,
                                    minimum_term_frequency = 5)

file_name = 'mogulsvscnn.html'
#open(file_name, 'wb').write(html.encode('utf-8'))
#IPython.display.HTML(filename=file_name)
from IPython.display import IFrame
IFrame (src = 'https://emzou.github.io/mogulsvscnn.html', width=1000, height=600)

Ludwig / Charlie

At first glance, the difference in plot shapes is obvious, compared to CNN/Ludwig, this plot is much more flat, meaning that the language across the two documents were more similar to each other. Of course, the videos were released on the same day, with the same name, but the common linguistics are suggestive: “zuck”, “boomer”, “core”. These streamer to streamer comparisons would be useful in identifying language and terms that are commonly used across different channels. Also important to note is that in the future, I would want to equalize the document and word counts.

html = produce_scattertext_explorer (corpus1, 
                                    category =  'moist', 
                                    category_name = 'moist', 
                                    not_category_name = 'lud', 
                                    width_in_pixels = 1000,
                                    minimum_term_frequency = 5)

file_name = 'mogulsvsmoist.html'
#open(file_name, 'wb').write(html.encode('utf-8'))
#IPython.display.HTML(filename=file_name)

from IPython.display import IFrame
IFrame (src = 'https://emzou.github.io/mogulsvsmoist.html', width=1000, height=600)

CNN / Charlie

This plot is the most polarized of the three– the comments on MoistCritikal’s side are much more explicit. Terms like “dickriding”, “clout”, and “amogi” are all unique, compared to CNN’s comments, which was more likely to mention proper nouns.

html = produce_scattertext_explorer (corpus2, 
                                    category =  'cnn', 
                                    category_name = 'cnn', 
                                    not_category_name = 'moist', 
                                    width_in_pixels = 1000,
                                    minimum_term_frequency = 5)

file_name = 'cnnvsmoist.html'
#open(file_name, 'wb').write(html.encode('utf-8'))
#IPython.display.HTML(filename=file_name)

from IPython.display import IFrame
IFrame (src = 'https://emzou.github.io/cnnvsmoist.html', width=1000, height=600)

Future Work / Plans

These plots were fun to make, but not terribly informative by themselves– we can get a broad idea of where things stand, but how language is actually used will require much more investigation. Other than the problems I already raised when selecting these sources (Difference in topic, difference in commenting amount), I also want to spend more time thinking about the implications of which channels I select from.

In future work, other than refining the process overall, I also plan to actually sample the comments for what they actually say, or how they are used– which is much harder to do computationally. (At least as I am aware of). When it comes to informal gaming/streaming slang, I also need to be careful with how I represent what they mean. Terms like “omegalul” and “kekw” have never been formally defined, so it is difficult to safely argue that a user meant one thing or another when using those words.